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Abstract 

The counting grid is a grid of microtopics, sparse 
word/feature distributions. The generative model 
associated with the grid does not use these mi¬ 
crotopics individually, but in predefined groups 
which can only be (ad)mixed as such. Each al¬ 
lowed group corresponds to one of all possible 
overlapping rectangular windows into the grid. 
The capacity of the model is controlled by the 
ratio of the grid size and the window size. This 
paper builds upon the basic counting grid model 
and it shows that hierarchical reasoning helps 
avoid bad local minima, produces better classi¬ 
fication accuracy and, most interestingly, allows 
for extraction of large numbers of coherent mi¬ 
crotopics even from small datasets. We evaluate 
this in terms of consistency, diversity and clarity 
of the indexed content, as well as in a user study 
on word intrusion tasks. We demonstrate that 
these models work well as a technique for em¬ 
bedding raw images and discuss interesting par¬ 
allels between hierarchical CC models and other 
deep architectures. 


1 INTRODUCTION 

Recently, a new breed of topic models, dubbed counting 
grids (CG) EEl, has been shown to have advantages in 
unsupervised learning over previous topic models, while 
at the same time providing a natural representation for vi¬ 
sualization and user interface design O. CC models are 
generative models based on a grid of word distributions, 
which can best be thought of as the grounds for a mas¬ 
sive Venn diagram of documents. The intersections among 
multiple documents (bags of words) create little intersec¬ 
tion units with a very small number of words in them (or 
rather, a very sparse distribution of the words). The grid 
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arrangement of these sparse distributions, which we will 
refer to here as microtopics, facilitates fast cumulative sum 
based inference and learning algorithms that chop up the 
documents into much smaller constitutive pieces than what 
traditional topic models typically do. For example. Fig. [T] 
shows a small part of such a grid with a few representa¬ 
tive words with greatest probability from each microtopic. 
Each of the Science magazine abstracts used to train this 
grid is assumed to have been generated from a group of mi¬ 
crotopics found in a single 4x4 window with equal weight 
given to all component microtopics. Thus, each microtopic 
can be 16 times sparser than the set of documents grouped 
into the window. 

A document may share a window with another very sim¬ 
ilar document, but it is also mapped so that it only par¬ 
tially overlaps with a window that is the source for a set 
of slightly less related documents. The varying window 
overlap literally results in a varying overlap in document 
themes. This modeling assumption results in a trained grid 
where nearby microtopics tend to be related to each other as 
they are often used together to generate a document. Con¬ 
sider, e.g., the lower right 4x4 window in Fig. [T] The word 
distributions in these 16 cells are such that a variety of Sci¬ 
ence papers on evidence of ancient life on Earth could be 
generated by sampling words from there. (Note that each 
cell, though of very low entropy, contains a distribution 
over the entire vocabulary.) In the posterior distribution, 
this window is by far the most likely source for an article 
on a bizarre microorganism that produced nitrogen in cre¬ 
taceous oceans. In the 4x4 window two cells to the left of 
this example we find mapped a variety of articles on even 
more ancient events on Earth, e.g. on how sulfur isotopes 
reveal a deep mantle storage of ancient crust. But there we 
also start to see words which increase the fit for articles that 
describe similar events on other planets. Further movement 
to the left gets us away from the Earth and into astronomy. 

To demonstrate the refinement of the microtopics compared 
to topics from a typical topic model, the color labeling of 
the grid was created so as to refiect the Kullback-Leibler 
(KL) divergence of the individual microtopics to the top- 





ics trained on the same data through latent Dirichlet allo¬ 
cation (LDA). The LDA topics, hand-labeled after unsu¬ 
pervised training, correspond to fairly broad topics, while 
the CG represents the data as a group of slowly evolv¬ 
ing microtopics. For example, all the yellow coded mi¬ 
crotopics map to the ’’Physics” LDA topic, but they oc¬ 
cupy a contiguous area in which from left to right the 
focus slowly shifts from electromagnetism and particle 
physics to material science. Furthermore, it is interest¬ 
ing to see the microtopics that occupy the boundaries be¬ 
tween coarser topics that LDA model found, capturing the 
links among astronomy, physics and biology. It is im¬ 
mediately evident that the 2D CGs can have great use in 
data visualization, though the model can be trained for ar¬ 
bitrary dimensionality [HI. These models combine topic 
modeling and data embedding ideas in a way that facili¬ 
tates intuitive regularization controls and allows creation 
of much larger sets of organized sparse topics. Further¬ 
more, they lend them selves to elegant visualization and 
browsing strategies, and we encourage the reader to see 
the example http://research.microsoft.com/ 
en-us/urn/people/jojic/CGbrowser.zip. 

However, the existing EM algorithm for CG learning is 
prone to local minima problems which occasionally lead 
to under performance mn). In addition, no direct testing 
of the microtopic coherence has been performed to date, 
which makes it unclear if they are meaningful outside their 
windowed grouping. After all, a variety of sophisticated 
topic models have been developed and tested by the re¬ 
search community, but LDA seems to still beat them of¬ 
ten in practice. E.g., [16,17] raise doubts that various re¬ 
ported perplexity improvements over the basic LDA model 
are meaningful as they are sensitive to smoothing constants 
in the model, and also fail to translate to improvements 
in human judgement of topic quality. In fact, LDA usu¬ 
ally outperforms more complex models on tasks that in¬ 
volve human judgement, which may be the main reason 
why practitioners of data science prefer this basic model 
to others m Here we develop hierarchical versions of 
CG models, which in our experiments produced embed¬ 
dings of considerably higher quality. We show that lay¬ 
ering into deeper architectures primarily aids in avoiding 
bad local minima, rather than increasing representational 
capacity: The trained hierarchical model can be collapsed 
into an original counting grid form but with a much higher 
likelihood compared to the grids fit to the same data using 
EM with random restarts. The better data fit then translates 
into quantitatively better summaries of the data, as shown 
in numerical experiments as well as human evaluations of 
microtopics obtained through crowdsourcing. 

2 HIERARCHICAL LEARNING OF 
GRIDS OF MICROTOPICS 

The (C)CG grids |[Tl|2l|: The basic counting grid TTk ID 
is a set of distributions on the d-dimensional toroidal dis- 



Figure 2: a) The basic counting grid, b) the componen- 
tial counting grid, c) the hierarchical counting grid model 
(HCG) obtained by stacking a componential counting grid 
and a counting grid, and d) the hierarchical componen¬ 
tial counting grid model (HCCG). Dotted circles represent 
the parameters of the models. Red links represents known 
conditional distributions P(/Cn I ^n) = - Eq. [5] They 

are distributions over the grid locations, uniformly equal to 
1/| W| in the window of size unequivocally identified 
by i. 

Crete grid E indexed by k. The grids in this paper are bi- 
dimensional and typically from {Ex = 32) x {Ey = 32) to 
{Ex = 64) X {Ey = 64) in size. The index 2 ; indexes a par¬ 
ticular word in the vocabulary 2 ; = [I... Z]. Thus, 7r[{z) is 
the probability of the word z at the d-dimensional discrete 
location i, and 7ri{z) = 1 at every location on the grid. 
The model generates bags of words, each represented by a 
list of words v^ = {wn}n=i with each word Wn taking an 
integer value between 1 and Z. The modeling assumption 
in the basic CG model is that each bag is generated from 
the distributions in a single window W of a preset size, 
e.g., Wx = 5 X Wy = 5. A bag can be generated by first 
picking a window at a d-dimensional location i, denoted as 
Wi, then generating each of the N words by sampling a lo¬ 
cation kn for a particular microtopic TTk^ uniformly within 
the window, and finally by sampling from that microtopic. 
Because the conditional distribution p(kn |^) is a preset uni¬ 
form distribution over the grid locations inside the window 
placed at location i, the variable k^ can be summed outlfD. 
and the generation can directly use the grouped histograms 

' ' jew^ 

where |W| is the area of the window, e.g. 25 when 5x5 
windows are used. In other words, the position of the win¬ 
dow i in the grid is a latent variable given which we can 
write the probability of the bag as 

P(w|^) = he{wn) = n (|4n ■ L 

tUriGw tOnGw 

As the grid is toroidal, a window can start at any position 
and there is as many h distributions as there are tt distribu¬ 
tions. The former will have a considerably higher entropy 
as they are averages of many tt distributions. Although the 
basic CG model is essentially a simple mixture assuming 
the existence of a single source (one window) for all the 
features in one bag, it can have a very large number of 
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Figure 1: Clash of topics: LDA topics are mapped onto a counting grid. As shown in the top left panel, LDA’s topics 
cluster in contiguous areas on the grid. In the enlarged part of the grid, for each microtopic we show the most likely words 
if they exceed a threshold. 


(highly related) choices h to choose from. Topic models 
(Tllsl, on the other hand, are admixtures that capture word 
co-occurrence statistics by using a much smaller number 
of topics that can be more freely combined to explain a sin¬ 
gle document. Componential Counting Grids (CCG) ||2l 
combine these ideas, allowing multiple groups of broader 
topics h to be mixed to explain a single document. The en- 
tropic h distributions are still made of sparse microtopics 
TT in the same way as in CG so that the CCG model can 
have a much larger number of topics than an LDA model 
without overtraining. More precisely, each word Wn can 
be generated from a different window, placed at location 
but the choice of the window follows the same prior 
distributions 0i for all words. Within the window at loca¬ 
tion In the word comes from a particular grid location kn, 
and from that grid distribution the word is assumed to have 
been generated. The probability of a bag is now 

P(w|7r)= L ■ (r^ L 

In a well-fit CCG model, each data point has an inferred 
0i distribution that usually hits multiple places in the grid, 
while in a CG, each data point tends to have a rather peaky 
posterior location distribution because the model is a mix¬ 
ture. Both models can be learned efficiently using the EM 
algorithm because the inference of the hidden variables, as 
well as updates of tt and h can be performed using summed 
area tables O, and are thus considerably faster than most 
of the sophisticated sampling procedures used to train other 


topic models. An intriguing property of these models is that 
even on a 32 x 32 grid with 1024 microtopics tt and just as 
many grouped topics h, there is no room for too many in¬ 
dependent groups. With a window size 8 x 8, for example, 
we can place only 16 windows without overlap, and the 
remaining windows are overlapping the pieces of these 16. 
The ratio between grid and window size is referred to as the 
capacity of the model, and the training set size necessary to 
avoid overtraining the model only needs to be 1-2 orders of 
magnitude above the capacity number. Thus a grid of 1024 
microtopics may very well be trainable with thousands of 
data points, rather than 100s of thousands that traditional 
topic models usually require for that many topics. 

Raw image embedding using (C)CGs: In previous ap¬ 
plications of CG models to computer vision, images were 
represented as spatially disordered bags of features. We 
experimented with embedding raw images with full spatial 
information preserved, and we present this here as we feel 
that the image data helps in illuminating the benefits of hi¬ 
erarchical learning. An image described by a full intensity 
function I{x,y) could be considered as a set of words, each 
word being an image location 2 ; = (x, y). For sl N x M 
image, we have a vocabulary of size M • N. The number 
of repetitions of word (x, y) is then set to be proportional 
to the intensity I(x,y). (In case of color images, the num- 














3. 

Z 

3 

3 

3 

3 

3 


2 , 

2> 

9 . 

w' 

9 

w' 


3 

3 

£ 

£ 

1 

Z 

z 

9 


3 

z 

z 

z 

z 

2 

2 

9 

2 

z 

2 

2 

2 

2 

2 

1 

2 


2 

2 

2 

2 

2 

2 

2 


2 

2 

2 

2 

2 

2 

2 

■2. 

■2 

2 

2 

2 

Z 

2 

2 


Figure 3: Intersecting digits on a grid of strokes. Each digit image is represented by counts (intensity) associated with 
image locations, a) 7r-distributions b) /^-distributions c-d) Intersecting digits 


ber of features is simply tripled with each color channel 
treated in this way). In other words, an unwrapped image 
is considered to be a word (location) histogram, tt and h 
distributions can then also be seen as images, as they pro¬ 
vide weights for different image locations. If we tile the 
image representations of these distributions we get addi¬ 
tional insight into CGs as an embedding method. Fig. [3] 
shows a portion of a 48 x 48 grid trained on 2000 MNIST 
digits assuming a 6 x 6 window averaging. To illustrate 
the generative model, in c) we show the partial window 
sums for two overlapping windows over tt. The green and 
blue areas form a window that generates a version of digit 
3, which can be seen at the top left of this portion of the 
h grid (panel b)). The blue and red, on the other hand, 
combine into a window that represent a digit 2 at the posi¬ 
tion (3,3) in panel b). Partial sums for green, blue and red 
areas are shown in c) and these partial sums, color coded 
and overlapped are also illustrated in d). Careful obser¬ 
vation of b) or the full grid in the appendix, demonstrates 
the slow deformation of digits from one to another in the h 
distributions. The appendix has additional examples of im¬ 
age dataset embedding, including rendered 3D head mod¬ 
els and images of bold eagles retrieved by internet search. 
The CG TT distributions shown here look like little strokes, 
while h distributions are full digits. The CCG model, on 
the other hand, combines multiple h distributions to repre¬ 
sent a single image, and so h looks like a grid of strokes 
Fig.lSi, while TT distributions are even sparser. 

Hierarchical grids: By learning a model in which micro¬ 
topics join forces with their neighbors to explain the data, 
(C-)CG models tend to exhibit high degrees of relatedness 
of nearby topics. As we slowly move away from one mi¬ 
crotopic, the meaning of the topics we go over gradually 
shifts to related narrowly defined topics as illustrated by 
Fig. m this makes these grids attractive to HCI applica¬ 
tions. But this also means that simple learning algorithms 
can be prone to local minima, as random initializations 
of the EM learning sometimes result in grouping certain 
related topics into large chunks, and sometime breaking 
these same chunks into multiple ones with more potential 
for suboptimal microtopics along boundaries. To illustrate 
this, in Fig. [4^ we show a 48 x 48 grid of strokes h (Eq. 
[B learned from 2000 MNIST digits using a CCG model 


assuming a 5 x 5 window averaging. Nearby features h are 
highly related to each other as they are the result of adding 
up features in overlapping windows over tt (which is not 
shown). CCG is an admixture model, and so each digit in¬ 
dexed by t has a relatively rich posterior distribution 0^ over 
the locations in the grid that point to different strokes h. In 
Fig. m we show one of the main principal components of 
variation in 0 as an image of the size of the grid. For three 
peaks there, we also show /i-features at those locations. 
The combination of these three sparse features creates a 
longer contiguous stroke, which indicates that this longer 
stroke is often found in the data. Thus, the separation of 
these features across three distant parts of the map is likely 
a result of a local minimum in basic EM training. To trans¬ 
fer this reasoning to text models, consider the 5th cell in the 
first row in Fig. [T]with words HIV, AIDS, and the blue cell 
in the middle of the last column with words SELECTION, 
ADAPTIVE. The separation of these two things in faraway 
locations may very well be a result of a local minimum, 
which could be detected if location posteriors exhibit cor¬ 
relation. This illustration points to an idea on how to build 
better models. The distribution over locations i that a data 
point t maps to (a posteriori) could be considered a new 
representation of the data point (digit in this case), with the 
mapped grid locations considered as features, and the pos¬ 
terior probabilities for these locations considered as feature 
counts. Thus another layer of a generative model can be 
added to generate the locations in the grid below, Eig. [2]:- 
d. It is particularly useful to use another microtopic grid 
model as this added layer, because of the inherent related¬ 
ness of the nearby locations in the grid. The layer above 
can thus be either another admixture grid model (Compo- 
nential Counting Grid - CCG), or a mixture (CG), and this 
layering can be continued to create a deep model. As CG 
is a mixture model, it terminates the layering: Its posterior 
distributions are peaky and thus uncorrelated. However, an 
arbitrary number of CCGs can be stacked on top of each 
other in this manner, terminating on top with a CG layer 
to form a hierarchical CG (HCG) model, or terminating in 
a CCG layer to form a hierarchical CCG (HCCG) model. 
In each layer, the pointers to features below are grouped, 
which should result in creating a contiguous longer stroke 
as discussed above in a grid cell that contains a combina¬ 
tion of pointers to the lower layers. 


















































a) 



Figure 4: The benefits of hierarchical learning: a) hccG - a bigger higher resolution version in the appendix, b) Principal 
components of 0 and three peaks put together. 


For the sake of brevity, we only derive the HCG learning 
algorithm with a single intermediate CCG layer. The ex¬ 
tension to HCCG and higher order hierarchies is reported 
in the appendix. Variational inference and learning pro¬ 
cedure for counting grid-based models utilizes cumulative 
sums and is only slower than training an individual (C)CG 
layer by a factor proportional to the number of layers. The 
graphical model for HCG is shown in Fig. [2]:, where loca¬ 
tion variables pointing to grids in different layers have the 
same name, £ but carry a disambiguating superscript. To 
avoid superscripts in the equations below, we renamed the 
CG’s location variable from to m and dropped the su¬ 
perscript ’ in the layer above. The bottom CCG layer 
follows 

P{Wn\kn,T^CCG) — T^CCG.kni.'^n) (4) 


P{kr,\ln) = UZ{kn) 


1^ if kn e 
0 Otherwise 


(5) 


The latter is a pre-set distribution over the grid locations, 
uniform inside . Instead of the prior Oi the locations 
are generated from a top layer CG, indexed by m (£(2) 

in 

the figure). 


P(iri\m,7TCG) 


1 


7rcG,k(^n) 

kGWrn, 


( 6 ) 


This equation also shows that the lower-levels’ grid loca¬ 
tions act as observations in the higher level. We use the 
fully factorized variational posterior q^{{kn}, {^n}, = 

q^{m) • Yl^ {q^{kn) • q^{£n)) to write the negative free en¬ 
ergy T bounding the non-constant part of the loglikelihood 
of the data as 


We maximize T with the EM algorithm which iterates E- 
and M-steps until convergence. E: 

II 


II 



log ^C'G,m (i) ^ 

(f[m = i) 



The M step re-estimates the model parameters using these 
updatedposteriors: 


7rccG,i 


= i) • [wi:, =«] 


TTCG,i(l) OC -S-CG,i(l) ■ = 1) ■ 
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k|iGWk 


q\kn i) 
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where the last (CG) update is performed analogous with 
im . Interestingly, training these hierarchical models stage 
by stage, reminiscent of deep models where such incremen¬ 
tal learning was practically useful ifTOl . 

Although it has been shown that a deep neural network 
can be compressed into a shallow broader one through 
post training dl, the stacked ( C-)CG models can be col¬ 
lapsed mathematically. In this sense we can view HCG 
and HCCG as hierarchical learning algorithms for CG and 
CCG, which are easier to visualize than deeper models. For 
example, for HCG in Fig. [2]:-d, it is straightforward to see 
that the following grid defined over the original features 

{Wn}, 

Mwn) = • ^CCG,i(«^") G) 


t,n,kn 

+ ^2 <i{kn)q*{G)\ogUZ{kn) 

t,n,kn jiri 

+ X] q\m)q\in)yOg7TCG,m{£n) 

t,m,£ri 


can be used as a single layer grid that describes the same 
data distribution as the two-layer modefl. However, the 
grids estimated from the hierarchical models should be 
more compact as the scattered groups of features are pro¬ 
gressively merged in each new layer. Learning in hierar¬ 
chical models is thus more gradual and results in better 

^hi are the grouped microtopics in the window Wi - Eq.[T] 









































local maxima, and we show below that the results are far 
superior to regular EM learning of the collapsed CG or 
CCG models. 

3 EXPERIMENTS 

In all the experiments we used models with two extra lay¬ 
ers, although, in some experiments, we found that three 
levels worked slightly better. In general, the optimal num¬ 
ber of layers will depend on the particular application. 

Likelihood comparison: In the first experiment we com¬ 
pared the local maxima on models learned using the (full) 
MNIST data set. The two layer HCG model was first pre¬ 
trained stage-wise as, e.g., cni , by training the higher level 
on the posterior distribution from the lower level as the in¬ 
put. Then, the model was refined by further variational EM 
training. The procedure is repeated 20 times with different 
random initializations to produce twenty hierarchical mod¬ 
els. As discussed above, these models can be collapsed to a 
CG model by integrating out intermediate layers (|7]). These 
models were then compared with twenty models learned by 
directly learning CG models through previously published 
standard EM learning algorithm starting from twenty ran¬ 
dom initializations. Despite being collapsible to the same 
mathematical form, the HCG models consistently produced 
higher likelihood than the CG models directly learned us¬ 
ing the standard method. In fact, each CG model created by 
collapsing one of the learned HCG models had log likeli¬ 
hood at least two standard deviations above the highest log 
likelihood learned by basic EM (p-value < 10“^^). Both 
learning approaches used the computation time equivalent 
to 1000 iterations of standard EM, which was more than 
enough for convergence. 

Document classification: Next we ran test to see if the 
increased likelihood obtainable with a better learning al¬ 
gorithm translates into increased quality of representation 
when posterior distributions for individual text documents 
are considered as features in classification tasks. We con¬ 
sidered the 20-newsgroup dataseH (20N) and the Master- 
cook datase{l (MC) composed by 4000 recipes divided in 
15 classes. Previous work inaiia reduced 20-Newsgroup 
dataset into subsets with varying similarities and we 
considered the hardest subset composed by posts from 
the very similar newsgroups comp . os .ms-windows, 
comp . windows . x and comp . graphics. We consid¬ 
ered the same complexities as in El, using 10-fold cross 
validation and classified test document using maximum 
likelihood. Results for both datasets are shown in Tab. [B 

Evaluation of microtopic quality using quantitative 
measures related to the use in visualization and index- 

^'http : //www. cs . emu . edu/af s/cs . emu . edu/ 
projeet/theo-20/www/data/news20.html 



CG 

HCG 

CCG 

HCCG 

linSVM 

20N 

MC 

82,3% 

38,7% 

83,5% 

38,9% 

83,4% 

76,2% 

85,0% 

78,9% 

77.5% 

71.3% 


Table 1: Document classification. When bold, hierarchical 
grids outperformed the basic grids with statistical signifi¬ 
cance (HCG p-value =2.01e-4, HCCG p-values < le-3). 
“linSVM” stands for linear support vector machines which 
we reported as baseline. 

ing: We evaluated the coherence and the clarity of the 
microtopics comparing the collapsed (2 layers) hierarchi¬ 
cal grids - HCG and HCCG with regular grids OEl, latent 
Dirichlet allocation (EDA) 171, the correlated topic model 
(CTM) im which allows to learn a large set of correlated 
topics and few non-parametric topic models ifT^fTSl . 
Generative models are often evaluated in terms of perplex¬ 
ity. However different models, even different learning al¬ 
gorithms applied to the same model, are very difficult to 
compare ifTbl and better perplexity does not always indi¬ 
cate better quality of topics as judged by human evalu¬ 
ators uni. On the other hand, the subjective evaluation 
of topic quality is highly related to measures that have 
to do with data indexing, e.g. quality of word combina¬ 
tions when used for information retrieval. Thus we start 
with a novel evaluation procedure for topic models which 
is strongly related to information indexing and then show 
that we obtain similar evaluation results when we use hu¬ 
man judgement. In the following experiments, we consid¬ 
ered a corpus V composed of Science Magazine reports 
and scientific articles from the last 20 years. This is a 
very diverse corpus similar to the one used in 0 . As pre¬ 
processing step, we removed stop-words and applied the 
Porters’ stemmer algorithm im. We considered grids of 
size 16 X 16, 24 x 24,32 x 32,40 x 40 and 48 x 48 fix¬ 
ing the window size to 5 x 5. (Previous literature showed 
that counting grids are only sensitive to the ratio between 
grid and window area, as long as windows are sufficiently 
big.) We varied number of topics for EDA and CTM in 
{10,15,..., 100,125,150,..., 1000}. Eor each complex¬ 
ity we trained 5 models starting with different random ini¬ 
tializations and we averaged the results. In each repetition, 
we considered a random third of this corpus, for total of 
roughly \V\ = 12K documents, Z = 20K different words 
and more than 600K tokens. 

To evaluate (micro)topics, we repetitively sampled k-tuples 
of words and checked for consistency, diversity and clarity 
of the indexed content. In the following, we describe the 
procedure used for evaluating grids. An equivalent proce¬ 
dure was used to evaluate other topic models for compari¬ 
son. 

To pick a tuple T of n words, we sampled a grid location L 
Then, we repetitively sampled the microtopic tt^ to obtain 
the words in the tuple T = {wi ,... rcn}. We did not allow 
repetitions of words in the tuple. We considered 5000 dif- 
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Figure 5: Microtopic evaluations. We compared 32 x 32 grids with the best result obtained by LDA and CTM. To avoid 
cluttering the graph, we did not report CCG results which were found inferior to the proposed hierarchical models. We 
also reported the gradient of the diversity curves to show that new samples steadily continue to contribute new tuples. 


ferent n = 2,3,4, 5-tuples, not allowing repeated tuples. 
Then we checked for consistency, diversity and clarity of 
content indexed by each tuple. The consistency is quanti¬ 
fied in terms of the average number of documents from the 
dataset that contained all words in T. The diversity of in¬ 
dexed content is illustrated through the cumulative graph of 
acquired unique documents as more and more n-tuples are 
sampled and used to retrieve documents containing them. 
As this last curve depends on the sample order, we further 
repeated the process 5 times for a total of 25K different 
samples. Finally the clarity llT9l , measures the ambiguity 
of a query with respect to a collection of documents and 
it has been used to identify ineffective queries, on average, 
without relevance information. 

Formally, the query clarity is measured as the entropy 
between the n-tuple and the language model P{w) (un¬ 
igram distributions) as ^y^P{yo\T) • log 2 where 

P{w\T) = We estimated the 

likelihood of an individual document model generating the 
tuple P{T\V) = YlwteT 1^) obtain P{V\T) us¬ 
ing uniform prior probabilities for documents that contains 
a word in the tuple, and a zero prior for the rest. Finally, to 
estimate P{w\T) we employed MonteCarlo sampling. 
Results are illustrated in Figl5]and should be appreciated by 
looking at all three measures together, as some can be over¬ 
optimized at the expense of others. The diversity curve 
that consistently grows as more tuples are sampled indi¬ 
cates that the sampled tuples belong to different subsets of 
the data, and are thus discriminative in segmenting the data 
into different clusters. The average tuple consistency, on 
the other hand, demonstrates that the sampled tuples do oc¬ 
cur in large chunks of the data, demonstrating that the in¬ 
duced clusters are of significant size. The clarity measure 


shows that the clusters made of texts retrieved using dif¬ 
ferent tuples have clear differentiation from the rest of the 
dataset in usage of all the words in the dictionary. We re¬ 
port results for the 32 x 32 grids and the best result of LDA 
and CTM which peaked respectively at 80 and 60 topics. 
Results for other grid sizes can be found in the additional 
material; they are stable across complexities with slightly 
better performances for larger grids. 

All grid models show good consistency of words selected 
as they are optimized so that documents’ words map into 
overlapping windows. Through positioning and intersec¬ 
tion of many related documents the words end up being 
arranged in a fine-grained manner so as to refiect their 
higher-order co-occurrence statistics. Hierarchical learn¬ 
ing greatly improved the results despite the fact that HCCG 
and HCG can be reduced to (C)CGs through marginaliza¬ 
tion d?]). 

Overall HCCG strongly outperformed all the methods, es¬ 
pecially with a total gain of 0.5 bits on clarity, which is 
around third of the score for LDA/CTM. Despite allowing 
for correlated topics that enable CTM to learn larger topic 
models, CTM trails LDA in these graphs as topics were 
over expanded. We also considered non-parametric topic 
models such as “Dilan” oa and the hierarchical Dirichlet 
process ca but their best results were poor and we did not 
reported them in the figure. To get an idea, both models 
only indexed 25% of the content after 5000 2-Tuples sam¬ 
ples and had a clarity lower of 0.7-1.2 bits than other topic 
models. 

Human judgments of topic coherence: We next tested 
the quality of the inferred topics. Topic coherence is of¬ 
ten measured based on co-occurrence of the top k = 10 
words per topic. While good as a quick sanity check of a 
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Figure 6: Result of word intrusion task. Statistical significance is denoted by *. p-values and further details on the test 
are reported in the appendix 


single learned model, when this measure is used to com¬ 
pare models, it will favor models that lock onto top themes 
and distribute the rest of the words in the tails of the topic 
distributions. The LDA models usually have a large drop 
off in topic coherence when the number of topics is in¬ 
creased to force the model to address more correlations in 
the data. Indeed, using this measure, LDA topics outper¬ 
form CG topics in case of small models. But as the number 
of topics grows, the microtopics trained by HCG signifi¬ 
cantly outperform both LDA and CG (see the appendix). A 
more interesting measure of topic quality, which not only 
depends on individual topic coherence but also on mean¬ 
ingful separation of different topics, requires human evalu¬ 
ation of word intrusions. In a word intrusion task ifTTll . six 
randomly ordered words are presented to a human subject 
who then guesses which word is an outlier. In the original 
procedure a target topic is randomly selected and then the 
five words with highest probability are picked. Then, an 
intruder is added to this set. It is selected at random from 
the low probability words of the target topic that have high 
probability in some other topic. Finally the six words are 
shuffled and presented to the subject. If the target topic 
shows a lack of coherence or distinction from the intrud¬ 
ing topic, the subject will often fail to correctly identify 
the intruder. This task is again geared towards only getting 
the top words right in a topic model and ignoring the rest 
of the distribution, which makes it unsuitable to compari¬ 
son with microtopic models which attempt to extract much 
more correlation from the data. Thus instead of picking 
the top words from each topic, we sampled the words from 
the target topic to create the in-group. After sampling the 
location of a microtopic from the grid I, we picked three 
randomly chosen words from or from the small groups 
of microtopics in the window of size 2x2, and 3x3 around 
I (The latter is equivalent to computing the window distri¬ 
butions h using windows of smaller size than the ones used 
in training and should give us the indication if the granu¬ 
larity assumed in the window size was exaggerated: If it 
is then averaging of nearby topics should significantly re¬ 
duce the noise due to forced topic splitting). For each of 
these groups we choose the intruder word using the stan¬ 
dard procedure. If in this harder task humans can identify 
intruders better for microtopic models than for LDA mod¬ 
els, this would indicate that the microtopics are not sim¬ 
ply random subsamples of broader topics captured in h and 


similar in entropy to LDA topics. They would be a mean¬ 
ingful breakup of broad topics into finer ones. We com¬ 
pared LDA (known to performed better than CTM on in¬ 
trusion tasks lUTl ). HCG, and HCCG, on randomly crawled 
lOK Wikipedia articles and used Amazon Mechanical Turk 
(24000 completed tasks from 345 different people). The 
trained grids were of size 32 x 32 and the windows 5x5. 
The optimal LDA size was chosen using likelihood cross- 
validation over the range of complexities as in the previous 
experiments (The peak performance there was at 80 top¬ 
ics). Results are shown in Figl6]as a function of the Eu¬ 
clidean distance on the grid of the intruder word from the 
topic. HCCG outperformed LDA (p-values for the 3 tasks 
L20e-ll, L88e-5, 2.97e-05) and HCG (p-values for the 3 
tasks 3.97e-18, l.Ole-ll, 3.14e-19) indicating that learn¬ 
ing microtopics is possible with a good algorithm. Overall, 
users were able to solve correctly 71 % of HCCG problems 
and only 58% of LDA problems. Interestingly, the perfor¬ 
mance of HCCG and HCG does not seem to depend on the 
distance of the intruder word: Even picking intruder word 
from a very close location rather than from a far away one 
lead to no additional confusion for the user. This shows 
that HCCG chops up the data into meaningful microtopics 
which are then combined into a large number of groups 
h that do not over broaden the scope. HCCG and HCG 
also outperformed respectively CG and CCG (see the ap¬ 
pendix). 

Learning to separate mixed digits. Finally, we show 
that an HCCG model can be used to perform a task that 
eludes most unsupervised and supervised models. We cre¬ 
ated a set of 10000 28 x 28 images, each containing two 
different MNIST digits overlapped. Fig. [71 We trained an 
HCCG model consisting of five 32 x 32 layers on this data 
stagewise by feeding L^{f) = from one 

layer to the next. Windows of size 5x5 were used in all 
layers. From layer to layer, the new representations of the 
image consist of growing combinations of low level fea¬ 
tures h from the bottom layer (sparseness of which is simi¬ 
lar to Fig. [4^). The hierarchical grouping is further encour¬ 
aged by simply smoothing L^{i) with a 5 x 5 Gaussian ker¬ 
nel with deviation of 0.75, before feeding it to the next layer 
(This is motivated by the fact that nearby features in h are 
related and so if two distant locations should be grouped, 
so should those locations’ neighbors). Once the model is 
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Figure 7: Unsupervised learning on mixed digits 


collapsed to a single HCCG grid the components no longer 
look like short strokes but like whole digits, mostly free of 
overlap: The model has learned to approximately separate 
the images into constitutive digits. Reasoning on overlap¬ 
ping digits even eludes deep neural networks trained in a 
supervised manner, but here we did not use the information 
about which two digits are present in each of the training 
images. 

4 CONCLUSIONS 

We show that with new learning algorithms based on a hier¬ 
archy of CCG models, possibly terminated on the top with 
a CG, it is possible to learn large grids of sparse related 
microtopics from relatively small datasets. These micro¬ 
topics correspond to intersections of multiple documents, 
and are considerably narrower than what traditional topic 
models can achieve without overtraining on the same data. 
Yet, these microtopics are well formed, as both the numer¬ 
ical measures of consistency, diversity and clarity and the 
user study on 345 mechanical turkers show. Another ap¬ 
proach to capturing sparse intersections of broader topics is 
through product of expert models, e.g. RBMs 1^ , which 
consist of relatively broad topics but model the data through 
intersections rather than admixing. RBMs are also often 
stacked into deep structures. In future work it would be 
interesting to compare these models, though the tasks we 
used here would have to be somewhat changed to focus on 


the intersection modeling, rather than the topic coherence 
(as this is not what RBM topics are optimized for). HCCG 
and HCG models have a clear advantage in that it is easy to 
visualize how the data is represented, which is useful both 
to end users in HCI applications, and to machine learning 
experts during model development and debugging. An¬ 
other parallel between the stacks of CCGs and other deep 
models is that the uniform connectivity of units is directly 
enforced through window constraints, rather than encour¬ 
aged by dropout. Finally, in this specific context we illus¬ 
trate a broader phenomenon that requires more methodical 
and broader treatment by the machine learning community. 
A more complex (deeper) model showed here large advan¬ 
tages in terms of training likelihood, but these advantages 
were not due to the expanded parameter space, because the 
resulting model is equivalent to a collapsed single layer 
model. Rather than being a refiection of increased repre¬ 
sentational abilities of the model, better likelihoods were 
thus the result of better fitting algorithm that consists of 
training a deep model (and then collapsing it into a simpler 
but equivalent parameterization). Similar phenomena are 
likely regularly encountered elsewhere in machine learn¬ 
ing, but not always recognized as such, as in the absence of 
the full knowledge of the extrema of the fitting criterion, an 
increase in performance is often inappropriately ascribed to 
better modeling rather than better model fitting. 
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